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propose^peveral  implementations  of  Gaussian  elimination  for  solving 
banded  linear  systems  on  multiprocessors.  Three  simplejlarchitectures  are  considered:  a  multipro¬ 
cessor  ring,  a  grid  array  and  a  bypercube.  Oar  complexity  analysis  fully  accounts  for  communication 
delays  by  using  simple^models  where  both  latency  and  actual  transfer  times  are  incorporated.  When 
the  number  of  processors  is  small  relative  to  the  bandwidth  of  the  system  a  row  interleaved  imple¬ 
mentation  of  Gaussian  elimination  algorithm  is  attractive.  Otherwise,  a  two-dimensional  grid  is 
essential  for  achieving  higher  speed-up.  The  hypercube  architecture  gives  the  smallest  communi¬ 


cation  latency  times. 
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1.  Introduction 

Symmetric,  positive  definite,  banded  linear  systems  constitute  one  of  the  most  important 
classes  of  linear  systems  encountered  in  scientific  computing.  In  this  paper  we  propose  and  analyze 
several  implementations  of  Gauss  elimination  for  dealing  with  such  systems  on  parallel  computers. 
The  implementations  are  developed  and  compared  for  three  simple  multiprocessor  architectures, 
namely  a  ring  of  processors,  cf.,  Figure  1,  a  rectangular  two-dimensional  grid  of  processors,  cf., 
Figure  2,  and  a  hypercube  of  processors,  cf.,  [13]. 

The  banded  linear  systems  treated  in  scientific  computing  are  often  very  large  with  large 
bandwidths.  For  problems  arising  from  the  discretization  of  elliptic  partial  differential  equations 
in  two  variables,  the  bandwidth  of  the  matrix  is  usually  of  the  order  of  y/N  if  the  system  is  of 
order  N.  For  this  reason  we  will  be  interested  in  cases  where  the  number  of  processors  is  smaller 
than  the  bandwidth  of  A,  and  in  the  less  restrictive  cases  where  it  is  smaller  than  the  square  of  the 
bandwidth.  In  either  of  these  two  cases  there  is  inherent  parallelism  in  the  Gaussian  elimination 
algorithm  that  can  be  exploited  and  our  focus  will  be  on  developing  effective  implementations  of 
this  simple  algorithm.  We  will  not  consider  the  case  where  the  number  of  processors  is  very  large 
compared  to  the  bandwidth,  such  as  for  example  when  the  matrix  is  tridiagonal.  This  case  can  be 
treated  by  special  techniques  such  as  cyclic  reduction  [4]  or  a  substructured  banded  eliminination 
[2,  11].  See  [5]  for  a  general  discussion  of  this  case. 

The  main  assumptions  on  the  architecture  are  that  each  processor  has  its  own  memory  and 
holds  its  share  of  the  data.  Data  is  moved  from  one  processor  by  message  passing  using  the  local 
links  of  the  array.  No  data  transfer  is  made  via  shared  global  memory,  or  via  a  bus. 

In  looking  for  effective  parallel  algorithms  we  are  guided  by  both  arithmetic  efficiency  and 
communication  efficiency.  The  timing  models  used  for  analyzing  complexity  assume  that  communi¬ 
cation  time  is  nonnegligible  as  compared  with  arithmetic.  Moreover,  for  generality,  a  start-up  time 
is  incorporated  whenever  estimating  the  time  for  performing  either  a  data  communication  task  or 
an  arithmetic  task.  We  assume  for  the  sake  of  simplicity  of  our  analysis  that  communication  and 
arithmetic  are  not  overlapped. 

As  will  be  seen,  an  important  limiting  factor  in  nearest  neighbor  arrays  is  the  start-up  in 
communication.  Intuitively,  when  the  number  of  processors  increases,  the  packets  of  data  which 
are  spread  to  a  larger  number  of  processors  must  travel  distances  that  become  larger  and  larger 
while  the  packets  become  smaller.  The  result  is  an  inevitable  increase  in  data  movement  start-ups 
and  hence  a  serious  obstacle  to  speed-up  for  very  large  number  of  processors.  We  are  then  lead 
to  the  following  paradoxical  question:  given  an  arbitrarily  large  number  of  processors  is  it  best  to 
use  all  of  the  available  processors  or  to  “turn-off"  a  few  of  them  and  use  only  part  of  the  available 
computing  power?  Clearly,  the  question  relates  to  one  fixed  algorithm  otherwise  an  easy  answer 
would  be  to  switch  to  a  different  method  (if  one  exists)  that  will  take  better  advantage  of  the  large 
number  of  processors. 

We  will  see,  for  example,  that  for  a  class  of  implementations  of  Gaussian  elimination  on 
multiprocessor  rings  or  grids  the  best  computing  time  is  sot  realised  for  the  maximum  allowable 
number  of  processors  when  N  is  large.  Thus,  the  best  time  that  can  be  achieved  on  a  grid  of 
processors  is  of  the  order  of  0{v*t*N)  where  v  is  the  half- bandwidth  of  the  matrix.  These  algorithms 
can  be  termed  lock-step  algorithms,  as  they  consist  of  executing  the  outer  loops  of  Gauss'  algorithm 
in  succession  without  overlapping  them.  A  different  approach,  the  wavefront  implementation,  will 
also  be  described  and  compared  with  the  simpler  lock-step  algorithms  Out  compariaoa  shows  that 
while  the  wavefront  approach  may  be  superior  when  the  number  cf  processor*  m  large,  this  is  uot 
true  for  small  number  of  processors. 

We  will  compare  the  performance  of  the  Gaussian  elimination  algorithms  os  the  three  different 
architectures.  In  the  case  of  a  small  number  of  processors,  we  mat  elect  to  map  the  ring  into  the 


Figure  1:  An  8-processor  ring. 


grid  or  hypercube  networks  and  use  an  algorithm  that  is  efficient  on  the  ring.  In  this  case,  it  is  not 
a  priori  clear  that  the  algorithms  designed  specifically  for  the  grid  or  hypercube  will  outperform 
the  algorithm  mapped  from  the  ring.  However,  we  will  see  that  the  performances  are  comparable 
when  the  number  of  processors  is  small. 

2.  The  three  models  and  their  properties 
2.1.  Description  of  the  models 

Consider  the  linear  system  Ax  —  f,  where  A  is  a  real  N  x  N  matrix  whose  half-bandwidth 
is  v,  i.e.,  we  have  a,j  =  0,  Vi,  j  :  |i  -  j |  >  i/,  and  whose  total  bandwidth  is  2i/  -  1.  We  will  make 
the  important  assumption  that  A  is  such  that  no  pivoting  is  necessary  in  Gaussian  elimination. 
We  would  like  to  solve  the  above  linear  system  on  a  multiprocessor  consisting  an  ensemble  of  k 
identical  processors,  each  with  its  own  memory  sufficiently  large  to  hold  its  equal  share  of  the  data. 
The  processors  are  interconnected  according  to  one  of  the  following  three  simple  schemes: 

1.  The  first  architecture  consists  of  a  nearest  neighbor  interconnection  ring,  see  Figure  1.  The 
processors  are  numbered  consecutively  from  1  to  k.  We  assume  that  a  vector  of  length  m  can 
be  sent  from  one  processor  to  one  of  its  neighbors  via  the  local  link  in  time  equal  to 

0R  +  mr/t,  (2.1) 

where  Pr  is  a  constant  latency  time,  and  tr  is  the  elemental  transfer  time  of  the  local  links,  in 
seconds  per  word.  We  assume  that  any  processor  can  send  (or  receive)  a  data  item  from  one 
neighbor  while  sending  (or  receiving)  another  data  item  from  the  other  neighbor. 

2.  The  second  architecture  consists  of  a  two-dimensional  grid  of  k  processors  arranged  on  a  square 
grid  with  y/k  processors  on  each  side,  see  Figure  2.  For  convenience  we  assume  that  the 
processors  on  each  side  of  the  grid  are  connected  to  those  on  the  opposite  side.  We  assume 
these  wrap  around  connections  in  order  to  yield  more  homogeneous  complexity  results.  These 
connections  are  not  essential  for  the  algorithms  themselves.  We  will  often  use  the  term  2  -  D 
array  for  this  scheme.  Our  assumptions  are  identical  with  those  of  the  ring  model  but  one 
processor  is  now  able  to  simultaneously  communicate  with  four  neighbors.  We  assume  that 


the  time  required  to  send  a  vector  of  length  m  from  one  processor  to  one  of  its  neighbors  is 
modeled  by 

0a  +  mra,  (2.2) 

where  0c  is  the  grid  latency  time,  and  tg  is  the  grid  elemental  transfer  time. 

3.  Finally,  the  third  architecture  consists  of  a  hypercube,  or  n-cube  k  =  2n  identical  processors, 
numbered  from  0  to  2"  —  1.  The  interconnection  in  the  hypercube  is  such  that  there  is  a  link 
between  two  processors  if  and  only  if  the  binary  representations  of  their  numbers  differ  by  one 
and  only  one  bit.  Thus  any  node  has  exactly  n  neighbors.  For  example,  the  8  nodes  of  the  3- 
dixrrnsional  cube  (n  *  3)  can  be  represented  as  the  vertices  of  a  three  dimensional  hypercube, 
as  is  shown  in  Figure  3.  We  assume  that,  it  takes  the  time 

0H  +  rnrg ,  (2.3) 

to  transfer  a  vector  of  length  m  from  one  processor  to  any  number  of  its  ^neighbors.  Moreover, 
we  assume  as  before  that  communication  in  all  of  the  n  directions  can  take  place  simultane¬ 
ously. 

In  Ipsen,  Saad  and  Schultz  [3],  the  formula  (2.1)  was  used  for  estimating  communication  times 
in  various  Gaussian  elimination  implementations  for  solving  dense  linear  systems.  We  remark  that 
latency  times  may  be  much  larger  than  elemental  transfer  times. 

For  arithmetic  computations,  we  assume  that  for  all  architectures  a  sequence  of  m  pairs  of 
arithmetic  operations  involving  an  addition  and  a  multiplication  (such  as  the  inner  product  of  2 
m-vectors)  takes  the  time 


Tf  +  mu. 


(2.4) 
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Figure  S:  3-D  view  of  the  3-cube. 

in  any  of  the  k  processors.  Formula  (2.4)  allows  for  the  case  in  which  each  node  of  the  multiprocessor 
can  itself  be  a  pipelined  or  vector  machine,  in  which  a  sequence  of  operations  is  done  more  efficiently 
if  the  number  of  operations  is  large.  In  case  each  processor  is  not  a  pipelined  machine  we  have 
7  —  0. 

In  order  to  simplify  our  analysis  we  will  assume  throughout  the  paper  that  one  cannot  overlap 
arithmetic  with  communication  in  any  processor.  The  reason  for  the  non-overlapping  assumption  is 
essentially  pedagogical  and  can  be  justified  as  follows.  Models  where  arithmetic  and  communication 
can  be  overlapped,  are  more  complicated  to  analyze.  Yet  observe  that  the  execution  time  of  an 
algorithm  in  an  environment  where  overlapping  is  assumed  is  within  a  factor  of  only  two  of  that 
same  algorithm  run  in  an  environment  where  no  overlapping  is  assumed.  Indeed,  consider  any 
algorithm  and  let  t\  be  the  total  time  required  to  perform  its  arithmetic  and  tj  be  the  total  time 
required  to  perform  its  transfers.  In  the  overlapping  case,  the  total  execution  time  tg  will  be  at 
least  max{tji,  tj)  which  represents  the  time  with  maximum  overlapping.  In  the  non-overlapping 
case,  it  will  be  simply  t a  +  which  satisfies 

+  fy)  <  ma*{tA,fy}  <  tE  <  tA  +  tT. 

2.2.  Data  Transfer  Operations  in  Gaussian  Elimination 

A  particular  data  transfer  operation  which  is  essential  in  Gaussian  elimination  is  that  of  sending 
a  vector  of  m  elements  from  one  processor  to  all  the  others  or  to  a  few  neighboring  processors.  As 
was  pointed  out  in  [12,  3],  an  efficient  way  of  achieving  this  type  of  data  movement,  is  to  pipeline 
the  data  as  follows.  Assume  that  the  data  is  to  be  moved  from  Processor  Pi  to  the  sequence  of  t 
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consecutive  processors  Pi,I^..Pi+ i,  i.eM  through  a  path  of  length  t  and  let  us  split  the  data  into 
ft  packets  of  equal  size.  Then  in  step  1  the  first  packet  is  sent  from  Pi  to  P fe.  In  step  2,  while 
sending  the  first  packet  to  P3,  P%  receives  the  second  packet  from  Pi.  Generally,  at  step  j,  the  first 
packet  reaches  Pj+i  from  Pj  while  the  second  packet  follows  from  Pj-i  to  Pj,  etc.  The  first  packet 
reaches  P,+i  at  the  itk  step  and  ft  —  1  more  steps  are  needed  for  the  remaining  packets  to  reach 
P,+i,  resulting  in  a  total  of  (ft  -  1)  +  i  steps  and  a  time  of 

*(?)  -  (P  ~  1  +  »)(£  +  jr)  (2.5) 

where  0,r  stand  for  0r,tr  (ring)  0g,tg  (2-D  array  grid),  or  0b, tr  (hypercube).  We  remark  that 
a  ring  can  always  be  embedded  in  the  2-D  array  grid  and  hypercube.  The  above  time  is  minimized 
for  the  optimal  number  of  packets  equal  to 


ftopt(myi)  = 


l(i  ~  l)m— , 


and  the  corresponding  optimal  time  is 


topt(m,i)  =  (y/mr  +  \/(i  -  1)0)* . 


Note  that  since  we  must  have  1  <  ft  <  m,  formula  (2.6)  is  valid  only  when 

_ l _ <I<_2L 

m(»  -  1)  0  i  -  1 

This  will  always  be  the  case  if  m  is  large  enough  with  respect  with  the  ratio  t/0,  which  is  usually 
small.  However,  in  case  l/[m(i  -  \)\>r/0  then  the  optimal  number  of  packets  is  ft  «  1  and  the 
above  timing  formula  becomes  ^(m,t)  =  i(ror  +  0).  When  r/0>m/(i  —  1)  then  the  optimal 
number  of  packets  is  ft  —  m  and  the  optimal  time  becomes  ^(m,!*)  *  (m  + 1  -  l)(r  +  0). 

Formula  (2.7)  expresses  the  time  in  which  a  vector  of  size  to  can  be  moved  along  a  chosen  path 
of  length  i  between  two  processors.  This  may  not  be  the  best  possible  time  to  transfer  data  between 
two  processors  because  several  parallel  paths  (i.e.,  paths  that  do  not  cross  each  other)  can  be  used 
simultaneously  to  perform  the  data  transfer.  For  the  ring  there  are  two  parallel  paths  between  any 
two  processors,  while  for  the  2-D  grid  there  may  exist  up  to  four  such  paths.  For  the  hypercube  it 
can  be  shown  that  n  parallel  paths  exist  between  any  two  given  nodes  [8]. 

Suppose  that  we  want  to  transfer  a  data  block  of  size  m  from  some  processor  to  all  the 
processors  in  a  ring.  An  efficient  way  of  doing  this  operation  is  to  pipeline  the  data  in  both 
directions  around  the  ring  from  the  initial  processor.  Then  to  reach  every  processor,  the  data  must 
travel  across  only  i  =  \(k  -  l)/2]  processors  in  each  of  the  two  directions  of  travel.  Therefore, 
broadcasting  a  vector  of  length  to  in  a  ring  requires  the  time: 


R)  m  •  (2-8) 

Consider  now  the  same  operation  of  transferring  a  vector  of  length  m  from  one  processor  to 
all  others  in  an  n-cube.  At  first,  let  us  describe  a  simple  algorithm  for  sending  one  element  from 
one  processor  to  all  the  other  processors  of  the  cube.  The  algorithm  is  based  on  the  fact  that  any 
n-cube  is  the  union  of  two  (n-  l)-cubes  whose  nodes  are  connected  in  a  one-to-one  fashion.  In  fact 
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this  constitutes  a  simple  recursive  definition  of  n-cubes.  Specifically,  all  the  nodes  whose  binary 
number  is  of  the  form  Qa  where  a  is  any  ( n  —  l)-bit  binary  number  representing  the  vertices  of  an 
(n-  l)-cube  are  linked  to  the  nodes  whose  numbers  are  la.  This  suggest  the  following  algorithm  for 
sending  one  element  from  node  0  to  all  other  nodes  in  an  n-cube.  The  powers  refer  to  concatenation. 


ALGORITHM:  Hypercnbe  Broadcast 

(1)  Start:  Send  data  item  from  processor  0"  to  processor  0n-1l. 

(2)  Loop:  For  j  =  2,3, ...  n  do 

All  processors  numbered  0"-J,+1ay,  where  ay  is  any  (j  -  l)-bit  binary  number  move  data 
item  in  parallel  to  processor  0"~ilay. 


Since  there  are  n  steps  in  the  above  algorithm,  any  given  element  can  be  transferred  to  all 
others  in  an  n-cube  in  a  time  n(0g  +  :#).  When  sending  a  vector  of  length  m  it  is  clear  that  the 
above  idea  of  pipelining  the  data  transfer  can  still  be  used.  Since  this  will  amount  to  pipelining 
the  data  transfer  in  a  path  of  length  n,  it  is  clear  that  formula  (2.7)  applies,  i.e.,  the  optimal  time 
for  sending  a  vector  of  length  m  from  one  processor  to  all  others  in  an  n-cube  is  given  by: 

tBrmodc(m,H)  m  (ymTH  +  V'(n-l)^)2.  (2.9) 


3.  Algorithms  for  the  ring  architecture. 

The  algorithms  we  develop  in  this  section  are  valid  only  for  the  case  when  the  number  of 
processors  is  smaller  than  the  bandwidth  of  the  problem  (k  <  v).  We  assume  a  ring  interconnection. 
However,  it  is  clear  that  since  a  ring  can  be  mapped  into  many  other  architectures  these  algorithms 
have  a  fairly  wide  range  of  application  whenever  k  <  v. 

j  3.1.  Interleaved  Gaussian  Elimination 

In  [3]  several  modified  Gaussian  elimination  algorithms  for  solving  dense  linear  systems  have 
been  proposed.  One  of  the  more  efficient  algorithms,  named  Row-Scattered  Gaussian  elimination, 
consists  in  interleaving  the  system  across  the  processors  and  performing  a  slight  variation  of  the 
usual  Gaussian  algorithm. 

In  order  to  adapt  this  interleaved  Gaussian  elimination  algorithm  to  banded  linear  systems, 
we  distribute  the  equations  of  Ax  =  f  among  the  processors  in  the  way  shown  in  Figure  4.  The  t‘* 
processor  contains  the  N/k  equations  *,  i  +  k,i  +  2k, ...,  »  +  N  —  k,  where  we  assume  for  convenience 
that  N  is  a  multiple  ox  k. 

Typically,  at  the  ith  step  of  Gaussian  elimination,  the  pivoting  row  (of  length  v)  is  first  sent  to 
I  all  processors  and  then  each  processor  eliminates  those  rows  among  the  rows  »+l,  i+2,  ..i+v- 1  that 

it  holds.  There  are  at  most  f  such  rows  in  each  processor.  In  order  to  keep  the  processors  busy  as 
much  as  possible  and  to  obtain  k  x  k  diagonal  blocks  that  have  the  nice  property  of  being  diagonal 
matrices,  we  employ  the  modification  of  the  classical  Gaussian  elimination  suggested  in  [3],  i.e., 
eliminate  the  i-th  variable  not  only  rows  *  + 1,  i+ 2, ..«'  +  v  -  1  but  also  in  rows  -  1,  *  -  2, ..,  q,  where 
q  =  This  will  result  in  a  matrix  that  is  upper  triangular,  banded,  and  with  diagonal  k  x  k 
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diagonal  blocks  (referred  to  as  DDB  matrices) .  The  resulting  algorithm  is  referred  to  as  Algorithm 
RIBGE  (Row-Interleaved  Banded  Gaussian  Elimination)  and  is  formally  described  below. 


ALGORITHM  RIBGE  (Row-Interleaved  Banded  Gaussian  Elimination) 

For  q  as  1,2,-.-  iV/Ar  do: 

For  i  =s  1,2,... A:  do: 

(1)  Send  pivot  row:  Send  row  number  (9  —  l)/r  +  t  from  processor  t  to  all  other  processors. 

(2)  Perform  eliminations: 

In  each  processor  Pj,j  =  1,2,  ..Jt  do  in  parallel: 

For  h=  (q  -  l)i  +  j.  Step  k,  while  {h  <  i  -f  v  and  j  #  »}  Do: 


ah,»  '• —  ah,»  ~  'thjaqk+i,*i 


where  7aj  —  ah,(9-i)t+>/a(»-i)t+i,(?-i)t+i- 


Let  us  estimate  the  time  to  perform  the  Gaussian  elimination  process.  For  each  step  t  = 
1, 2,  .  JV  —  1,  we  need 

•  to  transfer  a  row  of  size  v  to  all  processors  at  a  cost  of  (2.8): 


•  perform  eliminations  at  the  total  cost  of  +  71- 

Summing  up  the  above  times  for  i  =  1,2,. .N  -  1  results  in  the  total  of  approximately 


tr,R  «  N 


(3.1) 


for  communication  and 

tA  fa  l(*/u;  +  7)  (3-2) 

for  arithmetic.  These  results  will  now  be  recast  in  a  form  that  will  be  convenient  for  subsequent 
comparisons. 

Proposition  S.l.  The  total  time  required  for  performing  the  Row-Interleaved  Banded  Gaussian 
elimination  on  a  k -processor  ring  is  approximately : 


Iribge  »  1(*/w  +  7)  +  Nvtr  (1  +  KOr*)2 


(3.3) 


=  N 


(^1(^  +  7) 


]  +JV 


VTR  +  \j2kv0RTR  +  ^k/3R 
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Figure  4:  Row-wise  interleaving  of  a  banded  linear  system  in 
a  4  processor  system. 


where  K&y/k  and  qr  =  \J 


The  approximation  (k~l)/k  «  1  has  been  made  to  derive  (3.1)  and  (3.2).  The  above  estimates 
indicate  that  for  large  enough  bandwidth  the  time  for  arithmetic  is  reduced  by  a  factor  of  k ,  as 
compared  with  the  time  on  a  single  processor.  When  v  <  Jfc,  then  we  cannot  further  speed-up  the 
arithmetic  with  the  above  algorithm,  as  is  indicated  by  formula  (3.3).  Clearly,  we  will  then  be 
doing  better  with  k  =  u  processors  because  of  the  increase  of  communication  time. 

Thus  using  an  ever  larger  number  of  processors  will  not  always  speed-up  the  algorithm  i.e., 
there  is  an  upper  limit  beyond  which  it  is  ineffective  to  use  more  processors.  This  suggests  that 
there  is  an  optimal  number  of  processors  which  can  be  obtained  by  trying  to  minimize  the  timing 
formula  (3.3)  with  respect  to  the  parameter  k.  Unfortunately,  this  function  of  k  is  complicated  and 
its  exact  minimum  even  if  available  might  be  difficult  to  exploit  and  interpret.  In  order  to  simplify 
the  expression  we  first  replace  the  term  \(v  —  l)/Ar"|  by  the  approximation  v/k.  We  call  Iribge  the 
resulting  right  hand  side  of  (3.3).  A  second  simplification  is  realized  by  using  an  upper  and  lower 
bound  for  the  term  corresponding  to  communication  time,  namely 


k 


:0R  <  ^y/vTR  +  0*^  <  2VTR  + 


_ i 


v.-.y 
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Note  that  a  similar  expression  has  been  used  in  [3].  The  upper  bound  simply  corresponds  to  non- 
optimal  pipeling  of  data  transfers,  whereby  one  splits  the  data  into  |  packets  instead  of  the  optimal 
Hopt  packets  as  given  by  (2.6). 

Thus  the  time  Iribge  is  bounded  above  by 


tRIBGE  £  f(*)  =  N 


-{vuj  +  l)  +  2  {urR  +  -0r) 


(3.5) 


By  differentiating  the  above  function  with  respect  to  k  and  equating  the  result  to  zero,  we  find  that 
the  minimum  of  the  above  upper-bound  is  achieved  at 


(3.6) 


Substituting  k*  in  (3.5)  we  get  the  optimal  time 

t(k*)  =  2Nv  [\/("  +  l/v)PR  +  tr\  • 

Denoting  by  kopt  the  value  of  k  for  which  the  minimum  of  Iribge  is  achieved,  we  have 


(3.7) 


i»in  Irjbge  —  Iribge  {kopt)  <  ^/bge(^*)  <  ?(**)• 


(3.8) 


A  similar  argument  can  be  used  for  the  lower  bound 


Iribge  >  t{k)  s  N 
the  minimum  of  which  is  achieved  for 


£(t/w  +  if)  +  [istr  +  ^fi)J  , 


(3.9) 


(3.10) 


and  has  the  value 


f(Jt.)  =  2  Nu 


}J\{v  +  iM0R+\tR 


Moreover, 


“in  Iribge  —  Irjbge  {kopt)  >  li^opt)  £  !(*•)• 


(3.11) 

(3.12) 


Grouping  the  two  bounds  (3.8)  and  (3.12)  we  get  the  following  result. 


Proposition  3.2.  The  approximate  minimum  time  in  which  Algorithm  RIBGE  can  be  performed 
on  a  multiprocessor  ring  with  an  arbitrary  number  of  processors  is  such  that 


2  Nv 


\J\{v  +  l/'/)0R  +  <  nun  Iribge  <2Nv  [ y/iu  +  7/ v)Pr  +  tr J .  (3.13) 


This  result  indicates  that,  for  large  values  of  v,  the  optimal  time  is  nearly  linear  in  v. 
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Figure  5:  DDB  banded  upper  triangular  system  resulting 
from  interleaved  Gaussian  elimination  in  a  4-processor  sys¬ 
tem. 

3.2.  Solution  of  the  Resulting  Triangular  Systems 

The  upper  triangular  systems  resulting  from  the  Gaussian  elimination  algorithm  described  in 
Section  3.1  have  the  form  depicted  in  Figure  5.  Each  diagonal  block  is  a  k  x  k  diagonal  matrix, 
while  the  off-diagonal  blocks  Bj  are  dense.  The  solution  algorithm  for  such  a  system  can  be  adapted 
from  [3]  in  a  straightforward  way.  The  algorithm  starts  by  solving  the  bottom  right  k  x  k  block 
system.  Note  that  since  the  diagonal  blocks  are  diagonal  matrices,  this  amounts  to  one  single 
arithmetic  division  in  each  processor  with  each  processor  winding  up  with  one  component  of  xm, 
the  bottom  ^-dimensional  block  of  the  solution.  In  a  general  step  j,  we  first  need  to  deliver  those 
just  computed  k  components  to  each  processor.  This  can  be  achieved  by  rotating  the  data  i,+j 
one  element  at  a  time  around  the  ring,  and  requires  a  total  time  of 

k{0R  +  ~r)- 


»V-Y- 


Then  we  perform  the  vector  operation: 


bj  bj  “  B  j  x  j 


(3.14) 


where  ij  is  the  vector  consisting  of  the  v  components  following  those  of  the  jtk  block  Xj.  while 
Bj  is  the  rectangular  matrix  of  size  k  x  v  in  block  position  j,j  +  1-  The  vector  operation  (3.14) 
consists  in  k  independent  vector  operations  each  of  length  v.  Neglecting  the  cost  of  solving  the 
diagonal  systems,  the  arithmetic  time  for  a  typical  step  is  therefore 

1  +  vu. 

Summing  the  above  times  over  the  m  =  N/k  steps,  we  get  that  the  total  time  for  solving  the  system 
comes  to 

It,r  —  N{0r  +  tr)  (3.15) 

for  communication  and 

tA  =  +  7),  (3.16) 

for  arithmetic.  Observe  that  neither  time  increases  as  the  number  of  processors  k  increases.  Thus, 
it  always  pays  to  us  as  many  processors  as  possible.  Note,  however,  that  for  large  v  the  total  time 
for  solving  the  triangular  system  is  small  as  compared  with  the  time  for  Gaussian  elimination. 

4.  The  block  interleaving  algorithm  for  grid  arrays 

For  convenience  we  define  k  =  Jk  and  consider  the  k  X  k  grid  of  processors  of  Figure  2.  We 
number  the  processors  with  two  indices  as  they  would  be  numbered  in  a  matrix:  processor  PiJ  is 
the  processor  located  at  the  intersection  of  the  tth  row  of  the  grid  (starting  from  top)  and  the  jth 
column  (starting  from  left).  We  partition  A  into  square  sub- matrices  Aij  of  size  £  x  ~  and  for 
simplicity  we  assume  that  A  is  a  block-banded  matrix  of  block-bandwidth  k  i.e.,  we  have  A*j  =  0 
if  |»  —j\  >  k.  Note  that  for  a  general  banded  matrix  of  half-bandwidth  v  this  assumption  is  always 
satisfied  by  redefining  v  to  be  vntw  ~  k\v/{k  —  1)]. 

We  proceed  in  two  steps  to  assign  the  matrix  to  the  processors: 

1.  We  partition  A  into  square  sub-blocks  of  size  v  x  u. 

2.  Each  v  x  v  sub-block  is  itself  partitioned  into  square  blocks  of  size  |  x  ~  each.  This  partitioning 
is  then  mapped  naturally  onto  the  grid  of  processors,  i.e.,  the  ( i,j )  sub-block  of  each  i/xi/ 
block  is  assigned  to  the  processor  numbered  (t,j).  Clearly  we  need  only  store  the  nonzero 
sub-blocks. 

In  other  words,  relative  to  the  partitioning  of  A  into  sub-matrices  of  size  £  x  £,  the  submatrix 
Aij  belongs  to  processor  Pn(i),no-)>  where 

n(»)  s  1  +  Mod(t,«)  and  II(j)  =  1  +  Mod(j,K).  (4.1) 

Moreover,  the  element  0,7  of  A  belongs  to  processor  P*(» where 

*(0  =  "(r^i)  «L<1  ,y)  =  (4.2) 

This  is  illustrated  for  k  =  4  in  Figure  6.  With  this  scattering  of  the  data  each  row  of  the  matrix 
is  distributed  among  the  k  nodes  of  one  row  of  the  grid  and  each  column  is  also  distributed  among 
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3,2  3,3 


4,3 


2.2  2,3  2,4  2,1 

3.2  3,3  3,4  3,1  3,3 

4.2  4,3  4,4  4,1  4,3  4,3 

1.2  1,3  1,4  1,1  1,3  1,3  1,4 

2,3  2,4  2,1  2,2  2,3  2,4 

3,4  3,1  3,2  3,3  3,4 

4,1  4,2  4,3  4,4 


Figure  6:  Block  interleaving  of  a  banded  system  in  a  4  x  4 
processor  grid. 


the  nodes  of  one  column  of  the  grid.  In  order  to  avoid  confusion  between  the  rows  and  columns 
of  the  matrix  and  those  of  the  multiprocessor  grid,  we  will  refer  to  the  rows  (resp.,  columns)  of 
the  grid  as  the  grid-rows  (resp.,  grid-columns).  With  this  terminology,  the  Gaussian  Elimination 
algorithm  can  naturally  be  implemented  as  follows: 


_  •  /•  ..  ’j. •  *  "  •  *  •  •  •  V 


ALGORITHM  BIGGE  (Banded  Interleaved  Grid  Gaussian  Elimination) 

For  i  =  1, 2, . . .  N  -  1  do: 

(1)  Normalize  row:  Send  a,-,,-  to  all  processors  of  the  same  grid- row  and  divide  the  ith  row  by 
a„,  in  each  of  these  processors. 

(2)  Broadcast  normalized  pivot  row  vertically:  All  processors  containing  part  of  row  »,  i.e., 
processors  (ir(t'),*)  send  their  part  of  the  pivot  row  to  all  processors  (h,*),k  =  1,2 ,..n,h  ^ 
ir(t)  of  the  same  grid-column. 

(3)  Broadcast  column  of  multipliers  horizontally:  Processors  containing  part  of  column  »,  i.e., 
processors  (*,  jr(t))  send  their  part  of  the  column  of  multipliers  to  all  processors  (*,/),/  = 
1,2 ,..k,1  ^  jt(«)  of  the  same  grid-row. 

(4)  Perform  eliminations:  Using  the  multipliers  and  the  pivot  row,  each  processor  performs  its 
part  of  the  ith  step  of  Gaussian  elimination. 


The  following  result  estimates  the  time  of  executing  algorithm  BIGGE. 

Proposition  4.1.  If  k  <v,  the  total  time  for  performing  Algorithm  BIGGE  is  approximately 


t BIGGE  «  N  W  +  +  N^tg  [1  +  »cac]2  (4.3) 

*  N  [(k)2"  +  k7]  +  N  kTc  +  ^2uTg^g  +  GK ]  ’ 


where  qq  s 


Proof.  First,  we  discuss  the  time  for  data  communication.  We  will  neglect  the  lower  order  cost 
of  communicating  the  element  a,-,-  to  processors  (jt(i),*)  in  step  (1).  Steps  (2)  and  (3)  can  be 
overlapped  according  to  our  assumptions  (if  not  just  double  the  communication  time  in  the  above 
result).  Hence,  steps  (2)  and  (3)  are  essentially  two  overlapped  broadcast  operations,  one  in  a 
vertical  ring  of  #e  processors  and  the  other  in  a  vertical  ring  of  k  processors.  By  virtue  of  formula 
(2.8),  both  steps  (2)  and  (3)  can  therefore  be  accomplished  in  a  total  time  of 


<r(«)  * 


(4.4) 


Summing  the  above  times  over  the  iV  —  1  steps,  we  get  the  total  time  for  data  communication. 
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Second,  we  discuss  the  time  for  arithmetic.  Again  we  can  neglect  the  cost  of  arithmetic 
operations  of  step  (1)  as  they  are  of  lower  order.  In  the  elimination  step  (4)  each  processor  performs 
~  consecutive  operations  of  the  form  row  :=  row  -  scalar  *  row ,  where  row  is  a  row  of  length 
Hence,  according  to  formula  (2.4)  the  total  arithmetic  time  for  step  i  is 


*a(0  *  + 

ft 


and  for  N  —  1  steps  this  comes  to  approximately: 


+r>  • 


Adding  the  communication  and  the  arithmetic  times  we  get  the  total  of  (4.3). 


A  disappointing,  but  expected  result,  is  that  when  the  number  of  processors  gets  sufficiently 
large,  the  total  execution  time  increases  due  to  the  larger  number  of  start-ups  in  communication 
as  is  reflected  by  the  last  term  of  (4.3) . 

This  raises  the  question  of  the  optimal  nunber  of  processors.  Given  an  arbitrarily  large  number 
of  processors  (up  to  kmmx  —  </3)  is  it  best  to  use  all  of  the  available  processors  or  to  “turn-off”  a  few 
of  them  and  use  only  part  of  the  available  computing  power?  Although  this  seems  counterintuitive, 
we  show  that  in  general,  fewer  than  i/2  processors  must  be  used  in  order  to  get  the  maximum 
speed-up.  For  this  purpose  let  us  expand  the  total  time  (4.3)  in  terms  of  the  variable  x  s  We 
obtain 

tBiGGE  «  t{x)  s  N  x2u  +  (7  +  rG)x  +  +  N sJlvTdPa-  (4.5) 

Note  that  the  last  term  is  constant  with  respect  to  the  variable  x.  We  would  like  to  minimize 
the  expression  in  (4.5)  with  respect  to  the  real  variable  x.  However,  an  attempt  to  do  so  would 
immediately  lead  to  a  third  degree  equation  which  would  be  difficult  to  solve.  Fortunately,  we 
can  derive  an  approximate  solution  which  will  be  shown  to  be  nearly  optimal.  The  near  optimal 
solution  is  obtained  by  discarding  the  first  degree  term  in  x  in  (4.5),  i.e.,  we  seek  the  minimum  of 
the  following  lower  bound  of  t(x): 


l(x)  =  ,N  x2u+  +  N \J2wg0g- 

2  x 


Differentiating  with  respect  to  x,  it  is  found  that  the  minimum  of  l{x)  is  readied  for 


-e&r 


Substituting  in  (4.6)  we  get 


*(*)  >  U  s  i(*.)  =  3ATW1'3  /S  +  N\/2vtgPg  V*,  U  =  3 Nux*  +  Ny/ton^fc.  (4.8) 

Note  that  the  restriction  1  <  k  <  v2  translates  into  the  restriction  \  <  x  <.  v. 

The  next  proposition  establishes  that  for  large  1/,  the  time  t*  is  nearly  equal  to  the  minimum 


Proposition  4.2.  Let  topi  be  the  minimum  of  t(z),  x  >  0,  and  let  t.  be  defined  by  (4.8).  Then 


0  <  topi  -  t ,  <  4  1  +  tq 

~  t.  ~  3  (4u)»lZplJ*v\l* 

Proof.  By  definition 

top!  <  t(x.)  =  l(x»)  +  N(i  +  tg)x,  s  t,  +  N{-i  +  rG)x,. 
If  Xopt  is  the  value  of  x  for  which  t(x)  is  minimum,  we  also  have: 


(4.9) 

(4.10) 


This,  together  with  (4.12),  gives  (4.9). 

■ 

A  similar  argument  was  used  in  [7]  to  analyse  the  communication  complexity  of  the  Gaussian 
elimination  algorithm. 

The  proposition  asserts  that  the  minimum  time  for  performing  Algorithm  BIGGE,  is  of  the 
form  0(Nv2/3),  i.e.  the  best  speed*up  is  of  the  order  of  t/4f3,  instead  of  0(i/2)  as  might  have 
been  expected.  The  reason  why  the  best  possible  speed-up  cannot  be  achieved  is  due  to  the  term 
that  increases  with  k  in  t bigge i  i.e.,  to  communication  start-up  times.  If  the  successive  steps  of 
Gaussian  elimination  are  overlapped  then  it  is  possible  to  reduce  the  effect  of  start-up.  Thus,  for 
a  large  number  of  processors,  better  algorithms  can  be  found  as  will  be  seen  in  Section  6. 


Figure  7:  3-D  view  of  the  4-cube. 


5.  Gaussian  elimination  for  the  hyperenbe  multiprocessor 

The  hypercube  [13]  is  a  popular  ensemble  architecture.  The  n— cube  multiprocessor  consists 
of  2"  nodes  that  are  numbered  by  n-bit  binary  numbers,  from  0  to  2"  -  1.  The  interconnection  is 
such  that  there  is  a  direct  link  between  two  processors  if  and  only  if  their  binary  representations 
differ  by  one  and  only  one  bit.  In  the  case  n  =  3,  the  8  nodes  can  be  represented  as  the  vertices 
of  a  three-dimensional  cube,  see  Figure  3.  When  n  m  4  the  three-dimensional  representation 
shown  in  Figure  7  is  quite  useful.  In  Figure  7,  a  4-cube  is  obtained  by  joining  all  nodes  of  an 
inner-3-cube  with  their  geometrical  counterparts  of  an  outer-3-cube.  Some  general  results  on  the 
topology  of  hypercubes  are  discussed  in  [8]  while  reference  [9]  examines  communication  problems 
in  a  hypercube  multiprocessor. 


6.1.  The  Hypercube  Banded  Gaussian  Elimination 

For  this  section  we  assume  that  n  is  even.  The  number  of  processors,  denoted  by  k,  is  of 
the  form  k  =  2"  and,  as  in  Section  4,  we  define  k  =  y/k  =  2n/2.  In  order  to  be  able  to  derive  an 
analogue  of  Algorithm  BIGGE  for  the  hypercube,  we  first  map  a  kxk  grid  into  the  n-cube  and  then 
map  the  algorithm  onto  the  defined  grid.  However,  we  will  try  to  exploit  the  more  advantageous 
interconnection  features  of  the  hypercube,  when  moving  data  in  algorithm  BIGGE. 

To  map  2-D  a  grid  into  n-cubes,  we  make  the  important  observation  that  an  n-cube  can  be 
viewed  as  a  “cross-product”  of  two  ^ -cubes.  Indeed,  we  can  consider  the  n-bit  binary  label  of 
any  n-cube  node  as  the  result  of  concatenating  two  j-bit  binary  numbers,  say  b,  and  cy.  In  other 
words,  we  can  write  any  node  number  as  ay  =  6,-  A  cy,  where  A  denotes  the  concatenation,  and 
6,  ,  Cj  axe  the  first  and  second  £  bit®  of  the  node  number.  From  the  properties  of  the  n-cube  it  can 
be  easily  seen  that  when  bj  is  fixed  the  resulting  2?  nodes  obtained  by  varying  the  second  part  of 
the  binary  number,  i.e.,  by  varying  c,-,  form  an  lj-cube,  i.e.,  a  sub-cube  of  the  n-cube.  Similarly, 
when  we  fix  the  second  part  c,-  and  let  by  vary,  we  obtain  another  ^ -subcube. 

This  defines  in  a  natural  way  the  n-cube  as  a  cross-product  of  the  two  ^-cubes.  We  refer  to 
a  vertical  plane  as  an  f-cube  defined  as  the  set  of  all  o,y  where  j  is  fixed.  A  horizontal  plane  is 
defined  likewise  by  fixing  »  and  letting  j  vary.  This  is  illustrated  in  Figure  7  where  the  4  horizontal 
planes  are  numbered  with  one  of  the  four  binary  numbers  00,  01, 11, 10  (in  this  order)  from  bottom 
to  top.  Then  the  vertical  planes  defined  by  the  4  vertical  trapezes  are  in  turn  successively  numbered 
00,  01,  11,  10.  Each  node  of  the  4-cube  is  the  only  intersection  point  of  a  horizontal  plane  and  a 
vertical  plane.  Note  that  this  numbering  is  by  no  means  unique. 

Let  us  now  assume,  as  was  done  in  Section  4,  that  the  matrix  A  is  a  block-banded  matrix 
of  block-bandwidth  k,  where  each  block  is  of  size  u/k ,  i.e.,  we  have  Aij  —  0  if  )s  —  j\  >  k.  If 
we  partition  A  into  sub-matrices  of  size  *  x  £,  then  the  submatrix  A,j  is  assigned  to  the  node 
numbered  A(i)  Ah(j),  where  h(i)  =  B»nary(Mod(»,*c)j.  The  distribution  of  Figure  6  is  still  valid  but 
the  processors  on  the  same  block-column,  now  form  an  §-cube  while  those  of  the  same  block-row 
form  another  f-cube. 

The  only  difference  with  the  grid  algorithm  is  that  communication  is  faster  because  of  the 
richer  interconnections  of  the  subcubes.  In  what  follows  we  will  make  the  assumption  that  a  node 
can  be  crossed  simultaneously  by  two  streams  of  data,  one  going  vertically,  i.e.,  traveling  in  the 
vertical  plane  of  the  node,  while  the  other  travels  in  the  horizontal  plane  of  the  node.  It  is  not 
necesseary  to  rewrite  the  algorithm  as  it  is  identical  with  Algorithm  BIGGE.  We  refer  to  the  new 
algorithm  as  the  Hypercube  Banded  Gaussian  Elimination  (HBGE). 

We  now  estimate  the  time  required  to  perform  Algorithm  HBGE.  The  time  for  performing 
arithmetic  operations  is  identical  with  that  of  BIGGE,  i.e.,  it  is  given  by 

Umbge  -  N  |(£)  «  +  .  (5.1) 

In  order  to  estimate  the  time  required  for  the  communication  tasks,  all  we  need  to  know  is 
the  time  for  moving  a  vector  of  length  m  from  one  processor  to  all  others  in  an  $-cube,  as  this 
is  the  only  important  transfer  operation  of  the  algorithm,  which  takes  place  in  steps  (2)  and  (3) 
(simultaneously).  This  was  examined  in  detail  in  Section  2.2,  and  as  a  result  we  infer  the  following 
Proposition. 


Proposition  5.1.  The  total  time  required  to  perform  Algorithm  HBGE  on  a  hypercube  of  k  =  k2 
processors  where  k  is  a  power  of  2,  is  approximately 


tBBGE^N  +  +  N^tb  (l  +  oiB\/2idog%Kj 


(5.2) 


=  N 


+  N 


V-tb  +  2{»tb0b1^ 


+  PslogiK 


where  as  — 


Note  that  a#  has  been  chosen  of  this  form  in  order  to  be  consistent  with 


previous  remits. 


6.  Pipelining  and  the  wavefront  approach  for  the  grid  architecture 

As  was  observed  in  the  previous  sections,  when  the  number  of  processors  is  large  compared 
to  the  half-bandwidth  v ,  the  loss  of  efficiency  appears  to  be  nonnegligible  due  to  communication 
overhead.  Thus,  while  one  expected  a  speed-up  of  the  order  of  v2  using  the  maximum  allowable 
number  of  processors,  k  —  u2,  we  were  only  able  to  achieve  a  speed-up  of  the  order  of  v*/s.  The 
reason  for  this  inefficiency  is  that  those  algorithms  do  not  pipeline  the  successive  steps  of  Gaussian 
elimination.  When  a  column  or  a  row  of  information  is  broadcast,  some  processors  are  idle  waiting 
for  the  information  to  reach  them  and  since  the  number  of  processors  is  large  the  information  will 
travel  a  long  way  across  processors  leading  to  important  idle  times.  It  is  shown  in  this  section  that 
it  is  vital  in  this  situation  to  resort  to  pipelining,  i.e.,  to  employ  an  algorithm  of  finer  granularity 
and  overlap  the  successive  steps  of  the  elimination  as  much  as  possible.  There  are  a  variety  of  ways 
of  achieving  this  for  the  Gaussian  elimination  algorithm. 

A  systematic  way  of  pipelining  in  Gaussian  elimination  is  the  so-called  wavefront  approach, 
which  is  well-known  in  the  context  of  algorithms  for  systolic  architectures  [6].  We  describe  here 
an  adaptation  of  this  approach  for  solving  banded  linear  systems.  The  algorithm  can  be  viewed  as 
a  simple  implementation  of  the  banded  LU  factorization  described  in  pages  280-285  of  [6],  which 
seems  to  have  been  known  for  a  long  time  [10]. 

Put  in  very  simple  terms,  the  wavefront  approach  consists  in  performing  the  elimination  by 
skew-diagonals,  a  skew-diagonal  of  A  being  the  set  of  elements  a,;  such  that  i  +  j  is  equal  to  some 
constant.  In  other  words,  all  computations  relative  to  an  elimination  step  of  Gauss’  algorithm  are 
performed  simultaneously  on  the  elements  of  the  matrix  that  are  on  the  same  skew-diagonal.  The 
basic  idea  of  the  wavefront  approach  is  that  we  can  overlap  several  consecutive  such  computations 
which  are  referred  to  as  waves.  To  describe  the  algorithm  in  its  simplest  form,  we  initially  assume 
that  k  =  i/2,  i.e.,  each  box  of  Figure  6,  represents  a  processor  which  contains  one  nonzero  element 
of  A.  For  the  following  discussion,  we  refer  to  both  Figure  6  for  the  matrix  assignment  and  Figure 
8  for  an  illustration  of  the  successive  steps.  A  general  step  will  consist  of  a  communication  task 
and  a  computation  task.  In  the  very  first  step  of  the  algorithm,  processor  (1, 1)  sends  the  element 
<*i,i  downward  to  processor  (2,1),  which  then  uses  this  information  to  compute  the  multiplier 
12,1  —  <*2,1/014.  In  the  second  step,  this  multiplier  is  now  transferred  to  the  processor  (2,2)  to 
the  right.  Simultaneously,  processor  (2, 1)  passes  on  the  element  014  that  it  has  received  in  step 
1,  downward  to  processor  (3,1),  while  processor  (1,2)  transmits  the  element  01,2  downward  to 
processor  (2,2).  After  these  data  transfers,  processors  (1,3),  (2,2),  and  (3,1)  perform  in  parallel 
the  elimination  work  corresponding  to  the  first  row  of  the  matrix,  on  the  elements  01,3,02,2  and 
034.  Thus,  processor  (1,3)  is  actually  idle  (n:>  work  is  done  on  the  first  row  which  is  the  pivot 
row),  processor  (2,2)  computes  the  new  element  a\  2  resulting  from  the  elimination  of  row  1,  using 


the  data  <12,2,  01,2,  and  / 2,1  and  processor  (3,1)  computes  the  multiplier  h,i  —  03,1/01,1.  One  can 
distinguish  the  formation  of  a  first  wave  of  computations. 

In  the  next  step  (step  3)  this  wave  propagates  to  the  right  by  transferring  the  multipliers 
/j,i,t  =  2,3  to  the  eastern  neighbors  and  the  first  row  elements  downward  to  the  southern  neighbors. 
The  elimination  corresponding  to  the  first  row  is  then  performed  on  the  fourth  skew  diagonal  and 
the  new  multiplier  I*,  1  =  0.4,1/01,1  is  formed  in  processor  (4,1).  A  general  step  consists  of  trans¬ 
ferring  the  multipliers  l to  the  right  and  the  pivot  row  elements  downward,  and  then  performing 
arithmetic,  i.e.,  computing  the  reduced  elements.  At  the  fourth  step,  the  second  wave  correspond¬ 
ing  to  the  elimination  by  the  second  row  can  be  started.  Indeed,  the  last  task  of  the  elimination  of 
row  1  has  just  been  completed  on  the  elements  02,2  and  02,3  and  therefore  the  multiplier  {3,2  can 
be  computed  in  processor  (3,2).  Here,  we  need  to  send  not  only  the  elements  of  the  first  wave  to 
the  right  but  also  the  element  01,2  downward  from  processor  (2,2)  to  processor  (3,2). 

It  is  easy  to  deduce  the  following  steps.  The  first  8  steps  of  the  algorithm  axe  illustrated  in 
Figure  8  for  the  case  v  —  4.  A  new  wave  appears  every  three  steps  and  dies  off  after  a  total  of 
2(i/  -  1)  steps.  If  we  refer  to  a  front  as  the  set  of  the  concurrently  active  waves,  the  wavefront 
consists  of  a  total  of  v  -  2  waves,  dealing  with  v  -  2  consecutive  pivot  rows,  at  the  high  regime  of 
the  pipelining,  i.e.,  after  the  first  2v  —  2  steps  and  before  the  last  u  —  1  steps. 

New  waves  appear. in  steps  l,4,7,...3(t  —  1)  +  1,....  The  wave  carrying  the  information  for 
the  elimination  by  row  1,  i.e.  for  forming  the  matrix  A^,  appears  in  step  3(t  - 1)  + 1.  In  particular, 
for  t  =  N,  we  would  be  starting  a  fictious  wave  corresponding  to  the  elimination  of  row  N,  while 
finishing  the  work  on  element  a#j y  by  the  previous  wave.  In  other  words  the  algorithm  stops  in  step 
Z{N  -  1)  + 1  =  3 N  —  2.  A  comparison  with  previous  work  on  (dense)  wavefront  factorization  reveals 
that  the  method  requires  exactly  the  same  number  of  steps  [10].  Therefore,  the  only  difference  with 
the  dense  case  is  that  we  are  using  a  smaller  number  of  processors,  because  of  the  bandedness  of 
the  matrix. 

With  the  assignment  of  Figure  6,  the  active  processors  are  performing  the  same  amount  of  work 
at  any  given  step.  Typically,  in  the  same  wave,  some  processors  will  be  performing  an  elimination 
of  the  form 

«*/’  -  «?,  -  '»«{,  (61) 

while  (at  most  one)  will  compute  a  new  multiplier  1;*.  Looking  at  communication  patterns  each 
step  requires  simultaneous  nearest  neighbor  communication  whereby  any  processor  of  the  same 
wave  sends  multipliers  eastward  and  pivot-row  elements  southward.  These  data  transfers  can  be 
overlapped.  As  a  result,  each  step  costs  a  total  of  7+w  for  arithmetic  and  $g+tg  for  communication. 
Therefore,  the  wavefront  algorithm  for  factoring  a  banded  matrix  requires  a  total  of 

(3 N  —  2)  +  u>  +  0c  +  tg\  • 

A  remarkable  fact  is  that  this  number  does  not  depend  on  the  bandwidth  of  the  matrix.  However, 
any  complexity  result  based  on  this  observation  is  deceiving  since  the  number  of  processors  required 
to  achieve  this  time  increases  like  the  square  of  the  half-bandwidth.  It  is  unlikely  that  the  problems 
encountered  in  nature  will  exactly  fit  some  priviledged  number  k  of  processors  of  a  given  config¬ 
uration.  A  less  unrealistic  goal  is  to  solve  any  problem  by  choosing  a  solution  method  according 
to  various  parameters  of  that  problem,  such  as  its  size  and  its  bandwidth  in  the  present  case.  For 
example,  if  the  number  of  processors  is  much  larger  than  v3,  one  might  use  a  cyclic  reduction  type 
algorithm  [4],  or  a  substructured  Gaussian  elimination  method  [1]  which  axe  not  considered  here. 

As  a  consequence  we  now  assume  that  the  number  of  processors  is  less  than  v2.  The  above 
algorithm  can  be  generalized  by  simply  partitioning  the  matrix  in  blocks  of  size  £  each.  The  block 
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Figure  8:  Wavefront  algorithm  for  a  banded  linear  system. 


matrix  resulting  from  this  partitioning  is  of  block-^ize  N({v/k)  =  The  adaptation  of  the 
wavefront  method  is  straightforward.  The  multipliers,  now  matrices  of  size  require  multiplying 
the  inverse  of  a  matrix  by  another  matrix,  in  each  wavefront  step.  Likewise,  the  typical  operation 
(6.1)  becomes  a  matrix  operation.  It  can  be  seen  that  each  step  is  dominated  by  the  cost  of  the 
matrix  multiplication,  i.e.,  by  the  time  (jj)2(tt  +  f")-  Each  step  of  communication  on  the  other 
hand  requires  a  time  of  0g  +  (£)2  tg-  Summing  up  over  the  ZNk/v  -  2  steps  of  the  algorithm,  we 
find  an  approximate  total  time  of 

tw 'Block  *  [Q2  b  +  V-u)  +  0G  +  02  r<y] 

or, 

tw,Biock  &  3JV  [-  (7  +  —J)  4-  -0c  +  —tg\  • 

A  quick  comparison  with  the  ring  and  the  grid  algorithms  of  the  previous  sections  indicates  that  for 
small  values  of  k,  the  wavefront  algorithm  will  be  about  three  times  slower.  However,  the  wavefront 
algorithm  allows  us  to  increase  the  number  of  processors  to  i/2  and  still  attain  linear  speed-up. 

It  is  important  to  say  a  few  words  about  the  solution  of  the  resulting  triangular  system.  A 
wavefront-like  method  has  been  devised  for  solving  (dense)  triangular  systems  in  [3].  The  method 
called  TCB/G  (Triangular  system  solution  by  Column  -  Blocks/Greedy)  consists  of  sweeping  from 
bottom  to  top  by  marching  along  skew  diagonals.  The  method  can  be  easily  adapted  to  banded 
linear  systems  with  essentially  no  modification  to  it,  except  for  the  fact  that  the  processors  are  now 
assigned  in  a  manner  that  accounts  for  bandedness.  In  contrast  with  the  LU  factorization  there 
is  only  one  wave  moving  and  the  total  number  of  scalar  steps  is  2 N  —  1.  Therefore,  the  time  for 
solving  the  triangular  system  by  this  wavefront  method  is  comparable  with  the  time  to  perform 
the  Gaussian  reduction  to  triangular  form.  The  obvious  reason  is  that  processors  are  idle  during 
most  of  the  algorithm. 

7.  Summary  of  results  and  conclusion 

The  complexity  results  of  four  Gaussian  elimination  algorithms  are  summarized  in  Table  1. 
The  optimal  time  for  the  ring  shown  in  the  table  is  in  reality  the  upper  bound  given  by  (3.13).  For 
Algorithm  HBGE  the  optimal  time  is  difficult  to  compute.  The  time  shown  in  the  table  is  an  upper 
bound  obtained  by  letting  k  =  */,  i.e.,  by  using  the  maximun  possible  number  of  processors.  This 
will  be  nearly  optimal  in  general  as  the  total  time  is  the  sum  of  a  decreasing  function  of  k  and  a 
logarithmicly  increasing  function  of  k. 

To  conclude,  we  can  make  the  following  observations: 

•  The  time  for  performing  arithmetic  is  the  same  for  the  first  three  methods  and  three  times  as 
long  for  the  fourth  method  for  v  large  enough. 

•  The  time  for  communication  shows  a  decrease  by  a  factor  of  \fk  when  passing  from  a  linear  ring 
to  a  grid.  This  interesting  property  is  due  to  the  fact  that  the  bandwidth  in  the  vertical  and 
horizontal  directions  is  multiplied  by  \fk.  Also  latency  times  are  reduced  because  the  diameter 
of  the  grid  is  of  the  order  of  \fk  times  smaller  than  that  of  the  ring. 

•  For  the  hypercube  topology  the  latency  term  is  of  the  form  N0g^O92^-  The  consequence  of  this 
is  that  the  deterioration  of  efficiency  due  to  communication  overhead  grows  much  more  slowly 
than  that  of  the  ring  or  the  grid. 

•  For  a  large  number  of  processors,  in  all  methods  which  do  not  use  pipelining  (i.e.,  the  first 
three  methods)  the  degradation  in  performance  due  to  communication  overhead  is  important 
and  the  wavefront  approach  is  to  be  preferred,  as  is  reflected  by  the  last  column  of  the  table. 
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Note  :  a  —  where  r,0  are  either  tr,0r  (ring)  or  tg,0g  (grid),  or  tr,0r  (cube). 

Ihble  1:  Timings  for  parallel  Gaussian  elimination  algorithms 

•  For  small  number  of  processors  the  wavefront  method  does  not  perform  as  well  as  the  simpler 
nonpipelined  methods. 

•  The  optimal  times  in  which  a  banded  linear  system  can  be  solved  by  the  nonpipelined  Gaussian 
elimination,  if  we  had  an  arbitrarily  large  number  of  processors,  is  of  the  form  O(Nu)  for  the 
ring,  0(JVi/2/3)  for  the  2-grid  and  0(Nlogi(u))  for  the  cube.  Comparing  with  the  0(Ni/2)  time 
of  a  single  processor,  we  observe  that  the  important  change  takes  place  when  passing  from  the 
2-D  grid  architecture  to  the  hypercube  architecture. 
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